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Abstract 

We present performance measurements of direct gravitational iV-body simulation on 
the grid, with and without specialized (GRAPE-6) hardware. Our inter-continental 
virtual organization consists of three sites, one in Tokyo, one in Philadelphia and 
one in Amsterdam. We run simulations with up to 196608 particles for a variety 
of topologies. In many cases, high performance simulations over the entire planet 
are dominated by network bandwidth rather than latency. With this global grid 
of GRAPEs our calculation time remains dominated by communication over the 
entire range of N, which was limited due to the use of three sites. Increasing the 
number of particles will result in a more efficient execution. Based on these timings 
we construct and calibrate a model to predict the performance of our simulation on 
any grid infrastructure with or without GRAPE. We apply this model to predict the 
simulation performance on the Netherlands DAS-3 wide area computer. Equipping 
the DAS-3 with GRAPE-6Af hardware would achieve break-even between calcula- 
tion and communication at a few million particles, resulting in a compute time of 
just over ten hours for 1 iV-body time unit. 

Key words: high-performance computing, grid, N-body simulation, performance 
modelling 



1 Introduction 



Star c lusters a r e ofte n simulated by means of direct-method iV-body simula- 



tions (lAarsethl . Il985l ). The Newtonian gravitational force on individual stars 
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in such simulations is calculated by aggregating the force contributions from 
all other particles in the system. 



To enable faster execution of t hese simulations, specia lized solutions such 



as GRAvity PipEs (GRAPEs) flFukushige et all I2005T). Graphics P r ocess- 



ing Units (GPUs) ([Portegies Zwart et all 120071 ; lHamada and Iitakal . 12007 



Belleman et al.l . l2007l ) and Field-Programmable Gate Arrays (FPGAs) (ILienhart et al 



20021 ) have been successfully developed and applied. These solutions are de- 
signed or tuned specifically for optimizing force calculations, and provide dra- 
matic speedup. For example, the GRAPE-6Af features a dedicated hardware 
implementation that can calculate 42 force interactions simultaneously and 
with increased efficiency As a result, the GRAP E is able to perform force 
calculations ~ 130 times faster than a single PC (IMakino et all l2003al ). Re- 
cently, GPUs have shown gains in speed and flexibility, and they are now 
used for simulating self gravitating systems at speeds comparable to GRAPE 



( IPortegies Zwart et all 120071 : lHamada and Iitakal . 120071 : iBelleman et all 120071 ) 



Parallelization of GRAPEs appears t o be an efficient way to reduce the wall- 



clock time for individual simulations (IMakino et all l2003bl ; iGualandris et al. 



20071 ; lHarfst et all 120071 ) . The gravitational A^-body problem has calculation 
time complexity 0(N 2 ), whereas the communication scales only with O(N). 
For sufficiently large N, the force calculation time will therefore overtake the 
communication time. For a local cluster of GRAPEs with low-latency and 
high bandwidth networ k, break-even betw een calculation and communication 



is reached at JV ~ 10 4 (lHarfst et all . |2007|). 



Generally, GRAPE clusters are not cheap and few institutions can afford such 
dedicated hardware solutions. Still, more than 500 GRAPE modules, where 
one module is equivalent to one GRAPE-6Af, or 4 GRAPE-6 chips, are cur- 
rently in use across 37 institutions in 12 countries world-wide. An alternative 
to purchasing a large GRAPE-6 or GPU cluster is provided by a computa- 
tional grid. In a grid, several institutions assemble in a virtual organization, 
within which they shar e resources, and th e costs for purchasing and main- 
taining these resources (IFoster et all 120011 ). Grid middleware provides a se- 
cure wide area computing environment without requiring users to register for 
individual cl usters. In add i tion, grid-enabled MPI implementations , such as 
MPICH-G2 flKaronis et alll2002f ) or OpenMPI flGraham et al.l . l2006h . provide 
the ability to run MPI jobs across sites in the grid, using the existing MPI 
standards. Applying such grid technology to clusters of GPUs is an attrac- 
tive option, because there are a large number of (frequently idle) GPUs in 
consumer machines. By connecting these consumer machines to the grid (as 
was done in a similar f ashion with regular CPUs for the SETI@home project 
( lAnderson et all 120021 )) and using them for parallel N-body simulations, we 
can increase the computational power of the grid in a cheap and convenient 
manner. 



2 



Although there is a clear benefit of using grid technology in sharing financial 
burden, the real challenge is to develop new applications for astronomical 
problems that have yet to be solved. For example, the simulation of an entire 
galaxy, requires at least a few PFLOP/s of co mputational power a nd the 
development of a hybrid simulation environment (IHoekstra et all 120071 ). Such 
an environment performs several astrophysical simulations on vastly different 
temporal and spatial scales. For example, a hybrid simulation environment 
could consist of a stellar evolution simulation to track how individual stars 



evolve over time, a smoothed particle hydrodynamics simulation (iMonaghanl . 
19921 ) to simulate stellar collisions or close encounters, and a direct-method 
N-body calculation to simulate the remaining dynamics between stars. 



To facilitate these tightly-coupled multi-physics simulations on the PFLOP/s 
scale, it will no longer be sufficient to do high-performance computing (HPC) 
on a local cluster, as we require an extensive grid infrastructure consisting of 
several of such clusters. Although grid technology has been la rgely applied to 



facilitate high-throughput computing (lAbramson et all |2000| ) . little research 



has been done on investigating how the grid can be efficiently applied to solve 
tightly-coupled HPC problems. By using grid technology for this specific set 
of problems, we can potentially fullfill the computational requirements for 
performing petascale multi-physics simulations. 

Using a grid infrastructure for HPC has drawback, as the communication 
between grid sites dramatically increases network overhead compared to a 
local cluster. For intercontinental communication, the network latency can 
become as large as 0.3s, which is especially impractical for applications, such 
as direct-method N-body codes, that require communication over all processes 
during every iteration. Still, even for such long communication paths there 
will be a problem size (N) for which wall-clock time is dominated by the force 
calculation rather than by communication. Earlier experiments indicate that a 
grid of re gular PCs across Europ e improves overall performance for relatively 
small N ( IGualandris et all 120071 ). We address the question for which problem 
size a world-wide grid has practical usage, in particular if such a cluster is 
equipped with GPUs or GRAPEs. 



2 Experimental setup 



We have constructed a heterogeneous grid of GRAPEs, which we call the 
Global GRAPE Grid (or G3). The G3 consists of five nodes across three sites. 
Two nodes are located at Tokyo University (Tokyo, Japan), two are located 
at the University of Amsterdam (Amsterdam, the Netherlands) and one is at 
Drexel University (Philadelphia, United States). Each of the nodes is equipped 
with a GRAPE-6Af special purpose computer, which allows us to test several 
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Table 1 

Specifications for the nodes in G3. The first column gives the name of the computer 
followed by its country of residence (NL for the Netherlands, JP for Japan and US 
for the United States). The subsequent columns give the type of processor in the 
node, the amount of RAM, followed by the operating system, the kernel version and 
the version of Globus installed on the PC. Each of the nodes is equipped with a 
1 Gbit/s Ethernet card and GRAPE-6Af hardware. Local nodes are interconnected 
with Gigabit Ethernet. 



name 


location 


CPU type 


RAM 


OS 


kernel 


Globus 








[MB] 




version 


vader 


NL 


Intel P4 2.4GHz 


1280 


Ubuntu 5.10 


2.6.5 


4.0.3 


palpatine 


NL 


Intel P4 2.67GHz 


256 


RHEL 3 


2.4.21 


4.0.3 


yoda 


JP 


Athlon 64 3500+ 


1024 


FC 2 


2.6.10 


3.2.1 


skywalker 


JP 


Athlon 64 3500+ 


1024 


FC 2 


2.6.10 


3.2.1 


obi-wan 


US 


2x Xeon 3.6GHz 


2048 


Gentoo 06.1 


2.6.13 


4.0.4 



different resource topologies. Local nodes are connected by Gigabit Ethernet, 
whereas the different sites are connected with regular internet . In Table [T] we 
present the specifications of the G3. Each of the computers in the G3 is set 
up with Globus Toolkit middleware and MPICH-G20. 

In Table |2] we present the network characteristics, latency and bandwidth, 
of the connections within G3. We tested local area network (LAN) and wide 
area network (WAN) connections using the UNIX ping command to measure 
latency. We use scp for measuring the network bandwidth, transferring a 75 
MB file, rather than referring to theoretical limits because the majority of 
bandwidth on non-dedicated WANs is used by external users. For our per- 
formance measurements, we used a standard implementation of MPICH-G2 
without specific optimizations for long-distance networking. As a result, the 
MPI communication makes use of only 40%-50% of the available bandwidth 
0. If we were to enhance MPICH-G2 with additional optimizations, or add 
support for grid security to already optimized MPI libraries, such as Makino's 
tcplitE or OpenMPI, our bandwidth use would be close to the bandwidth use 
of a regular file transfer. 



The N-body integrato r we have chosen for our experiments uses b lock time- 



steps ( jMcMillanl . Il986l ) with a 4th order Hermite integration scheme ( jMakino and Aarsethl . 



19921 ). The time steps with which the particles are integrated are blocked in 
powers of two between a minimum of 2 -22 and a maximum of 2~ 3 . During each 
time step, the codes perform particle predictions, calculate forces between par- 



1 http://www.globus.org 

2 http://www3.niu.edu/mpi/, in the future: http://dev.globus.org/wiki/MPICH-G2 

3 for more information we refer to 



a 



research report from INRIA: 



http://hal .inria.fr / inria-0014941 1 /en/ 



see ]http://grape.mtk.nao.ac.jp/^makino/ softwares 
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Table 2 

Characteristics of local and wide network connections. Latency indicates the re- 
quired time for sending 1 byte through the network connection. The bandwidth 
indicates the transfer capacity of the network connection. The bandwidth was mea- 



sured with a 75MB scp file transfer. 


connection 


latency 


bandwidth (theory) 


bandwidth (real) 




[ms] 


[MB/s] 


[MB/s] 


Amsterdam LAN 


0.17 


125.0 


11.0 


Tokyo LAN 


0.04 


125.0 


33.0 


Amsterdam - Tokyo WAN 


266.0 


57.0 


0.22 


Amsterdam - Phil. WAN 


104.0 


312.5 


0.56 


Philadelphia - Tokyo WAN 


188.0 


57.0 


0.32 



tides and correct particles on a block of active particles. Particle corrections 
include updates of positions and velocities, and computation of new block time 
steps of particles. For our experiments we use three implementations of a par- 
allel iV-body integrator. One of these codes runs on a single PC with and with- 
out GRAPE. Th e two others are parallelized us i ng M PI: one of these uses the 



copy algorithm (Makino 



ring algorithm (IFox et al 



2002; Dorband et al 



19881 : 1 Angus et al 



1990 



2003) and the other uses th e 



Gualandris et all 120071 ) 



The copy algorithm has smaller number of communication steps whereas the 
ring algorithm has lower memory usage on the nodes. 



We initialize the simulations using Plummer (jPlummerl . 1 1 9 1 lh spheres that 
were in virial equilibrium and performed our simulations using a softening 
parameter of 2 -8 . S ince our simulations are p erformed over one dynamical 
(N-bodj) time unit (IHeggie and Mathieul . ll986l ). the realization of the N-body 
system is not critical to the timing results. 



3 Results of grid calculations 



We have performed a number of simulations on local machines and on the 
G3, which consists of simulations lasting one N-body time unit and shorter 
simulations lasting one integration time step. We measured the full wall-clock 
execution time for the longer simulations and we profiled the shorter simula- 
tions. 



3.1 Timing results of N-body calculations 



We run the A^-body codes, discussed in §[2j on a single PC and across the net- 
work in parallel using N = 1024 to N = 65536 (a few additional calculations 
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were performed with N > 65536). The runs were performed with and without 
GRAPE. In Figs. [I] and [2] we present the results of the copy and ring algo- 
rithms. If a simulation is run multiple times with the same problem set, the 
execution time may be slightly different per run. This variation is relatively 
small, as the slowest of 4 repeated runs (using 32768 particles over two sites) 
was a factor 1.07 slower than the fastest run. The variation can be primarily 
attributed to fluctuations in the network bandwidth. 



Single PC 

The performance on a single PC (represented by the thick solid line with 
bullets in Fig{T]) is entirely dominated by force calculations, which scales as 
0(N 2 ). As the number of steps per N-body time unit increases with N, the 
execution time scales slightly worse than N 2 . 



Grid of PCs 

The performance on the G3, without using GRAPE, is given by the thin 
dashed line with triangles. For N < 24576, the performance is dominated by 
network communication. Given that p indicates the number of process es, the 
network communication scales as O (N log p) (see ( IHarfst et all 120071 )). For 
our grid-based simulation experiments without GRAPE, break-even between 
communications and force calculations is achieved around N ~ 3 • 10 4 for the 
copy algorithm (FiglTJ, and at a somewhat higher value for the ring algorithm 
(Fig 12]). For larger N, the execution time is dominated by force calculations, 
rather than network comm unication. For these high N, the grid speedup T 
(IHoekstra and Slootl . 120051 ). which is the single-site execution time divided by 
the execution time over three grid sites, increases to 1.37 for the copy and 
1.24 for the ring algorithm. As can be seen by comparing Figs. [I] and [21 the 
copy algorithm gives overall better performance than the ring algorithm. This 
can be explained by the smaller number of communication steps in the copy 
algorithm. 



Single PC with GRAPE 

The performance on a single PC with GRAPE is dominated by force calcula- 
tions, although communication between host and GRAPE, and operations on 
the host machine have an impact on performance for N < 16384. In addition, 
the GRAPE performs less efficiently for low N, because many blocks are too 
small to fill the GRAPE pipelines. For larger N, force calculations become the 
performance bottleneck, and the scaling of the execution time becomes that 
of a single PC without GRAPE. 

Grid of PCs with GRAPE 

The performance on the G3 (with GRAPEs) using all three sites is given by the 
thin solid line with triangles. For all problem sizes N we have measured, the 
grid speedup T is less than 0.15, indicating that the performance is dominated 
by network communication. The network communication time scales better 
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than the force calculation time, therefore, force calculation time will overtake 
the network communication time if N is sufficiently large. However, this break- 
even point lies at much higher N than for a Grid of PCs, because the use of 
GRAPE greatly decreases the time spent on force calculations. 

For the copy algorithm (see Fig. [1]), calculations between Tokyo and Philadel- 
phia take less time than calculations between Amsterdam and Tokyo, due to 
a lower network latency (see Table [2]). The calculations across three sites take 
more time than calculations across two sites. This is caused by the latency of 
all-to-all MPI communications in the copy algorithm, which scales with the 
number of processors. 

According to our profiling measurements in Fig. [31 for N < 12288, a simulation 
on the G3 with GRAPEs using ring algorithm spends most of its time in net- 
work latency. For larger N more time is spent on using the network bandwidth. 
These results indicate that network bandwidth is the primary bottleneck for 
our simulations on the G3 using ring algorithm. When we compare the results 
of the runs on the grid with GRAPEs with each other, we do not notice any 
systematic trend. The results confirm that the wall-clock time is dominated 
by using the network bandwidth, which is bottlenecked by the transpacific 
network line for all grid setups. 



3.2 Profiling of the N-body simulations 



We have chosen one parallel algorithm (ring) and one resource topology (3 
nodes on 3 sites) to profile the simulation during one integration time step. The 
block size n for every measurement was fixed using a formula for calculating 
average bloc k size (n = 0.20iV a81 ), which has been used for the same initial 
conditions in lPortegies Zwart et al.l (120071 ). During execution, we measured the 
time spent on individual tasks, such as force calculations or communication 
latency between processes. We have profiled our simulations for iV = 1024 up 
to N = 196608, using the timings measured on the process running in Tokyo. 
The results of these measurements are given in Fig. [3j 



We find that for larger N, low bandwidth of our wide area network affects 
the outcome of the performance measurements, and that MPI calls are only 
able to use about a quarter of the available bandwidth for passing message 
content. For N ^ 5- 10 5 we expect the force calculation to take more time than 
network latency. If we were to use the network bandwidth more efficiently for 
such a large number of particles, the execution time would be dominated by 
force calculations. The network bandwidth can be used much more efficiently, 
either by using a more efficient MPI implementation (e.g. one that supports 
communication over multiple tcp connections) or by using a dedicated net- 
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N 

Fig. 1. The time for running the application for 1 iV-body time unit (T ap p) as a 
function of the number of stars (N) using the copy algorithm. The two thick lines 
give the results for a single CPU with GRAPE (lower solid curve) and without (top 
dashed curve). We make the general distinction between solid curves to present the 
results for simulations run with GRAPE, and dashed curves to give the results with- 
out GRAPE. The results on the grid are presented with four different lines, based 
on the three included locations. Each of these runs is performed with one node per 
site. The results for the WAN connection Philadelphia-Tokyo, Amsterdam-Tokyo 
and Amsterdam-Philadelphia-Tokyo are indicated with the solid curves with filled 
squares, open squares and filled triangles, respectively. The dashed curve with filled 
triangles give the results for the Amsterdam-Philadelphia-Tokyo connection but 
without using GRAPE. Dotted lines indicate the performance of runs with GRAPE 
according to the performance model. 
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Fig. 2. The time for running the application for 1 ./V-body time unit (T app ) as a 
function of the number of stars (N) for runs using the ring algorithm. See Fig.l for 
an explanation of the lines and symbols. 
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block size [n] 

76 133 234 410 720 1262 2213 3880 6803 11926 




N 



Fig. 3. Share of wall-clock time spent on individual tasks during a single time-step. 
Solid lines indicate tasks performed on the local machine. The thick solid line with 
filled circles represents time spent on force calculations, and the thin solid lines 
give the result for time spent on communication between PC and GRAPE (open 
triangles), particle corrections (open circles) and particle predictions (open squares) 
respectively. Dotted lines indicate time spent on communication between nodes. 
The thin dotted line with asterisks indicates time spent on communication latency 
between nodes and the thick dotted line with solid squares indicates time spent on 
using the network bandwidth. 



work. Using our current networking and MPI implementation, we expect that 
for iV 2 • 10 6 particles the force calculation time overtakes the bandwidth 
time. 



4 Modelling the performance of the grid 



In order to further understand the results and to enable performance pre- 
dictions for larger network setups, we decided to model the performance 
of the grid calculations. We model the performance o f the simulatio n by 
adopting the paral lel performance models described by iMakinol (120021 ) and 



Harfst et al.l (120071) and combinin g it with the grid performance model de- 



scribed in iGualandris et al.l (120071 ) . Further extension and calibration of the 
model allows us to simulate the performance of our N-body simulations on a 
G3 or any other topology. 
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4.1 Single PC 



An iV-body simulation over one iV-body time unit (IHeggie and Mathieul . ll986f ) 
consists of the following steps: 

(1) Read the input snapshot and initialize the N-body system. 

(2) Compute the next system time t and select the block of n active particles 
in the system. 

(3) Predict the positions and velocities of all N particles to time t. 

(4) Calculate the forces and their time derivatives between the n active par- 
ticles and all iV particles in the system. 

(5) Correct the positions, velocities and velocity derivatives of the n active 
particles, and update their time steps. 

(6) Repeat from step 2 until t has exceeded one N-body time unit. 

(7) Write the output of the simulation and terminate it. 

As relatively little time is spent on program initialization and finalization, we 
focus on the time to integrate the system (T integrate ), which consists of the 
tasks performed in steps 2 to 5. Throughout this paper we use uppercase T 
to refer to the time spent in n s t eps integration steps, and the lowercase (t) for 
the time spent in a single step. Within a single step, the total execution time 

-^integrate IS 

^■steps 

-^Integrate ^ ] (^pred tforce ^corr) > (1) 
i=l 

with the time spent on predicting particles 

^pred Tpred-N, (2) 

the time spent on calculating forces 

^forcc Tforce^--^ \ (3) 

and the time spent on correcting the active particles 

^corr 7"corr^' (4) 

Here r pre d is the time to predict a single particle, rf orcc is the time to calculate 
the forces between two particles, and r corr is the time spent to correct a single 
particle. The values for r pre d, Tf orce and r corr have been measured by using a 
sample N-body simulation with 32768 particles, and are given in tableOfor the 
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Table 3 

Machine performance specification and machine-specific constants. The first two 
columns show the name of the machine, followed by the country of residence. The 
third column indicates machine speed in Mflops, using the Whetstone benchmark. 
The last three columns give the time required for the CPU to perform one particle 
prediction (r pre d), the time required for one force calculation between two particles 
(Tforce) and the time required for correcting one particle (r cor r) respectively, all in 
microseconds. 



name 


location 


speed 


Tpred 


''"force 


''"corr 






Mflops 


M 


H 


[fjs] 


vadcr 


NL 


377 


0.247 


0.216 


4.81 


palpatine 


NL 


422 


0.273 


0.193 


2.39 


yoda 


JP 


436 


0.131 


0.110 


1.29 


skywalker 


JP 


436 


0.131 


0.110 


1.29 


obi-wan 


US 


1191 


0.098 


0.148 


1.14 



various nodes in the G3. For a more practical comparison, we also measured 
the compute speed (in floating point operations per second) for each of the 
nodes. These measurements we re carried out using the Whetstone benchmark 



( ICurnow and Wichmannl . Il976l ). 



4-2 Grid of PCs with copy algorithm 



The performance model for a single PC can be extended to include the parallel 
operation in the copy algorithm. In the copy algorithm, each process has a full 
copy of the system of N particles, but only computes the active particles in 
a specific subset of N/p particles. The result of this computation is sent to 
all other processes. We assume that all p processes have comparable speed, 
and every process has an equally sized subset of n/p active particles. For the 
copy algorithm, the host computation time (T integrate ) also consists of the time 
spent to communicate between processes (T MPI ). Therefore, 



^■steps 

-^integrate ^ ] (^pred tforce ~t~ ^corr) ~t~ ^MPIj (5) 
i=l 

A process computes forces for its subset of n/p active particles, and corrects 
only these particles. Therefore, a process requires at most Nn/p force calcu- 
lations per time step 



n 

tforce Tforee-'V , (6) 
p 
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and a process corrects at most n/p particles, of which the time spent is given 
by 



n , 

tcorr 7~corr • ( ' ) 

V 



In a parallel system, time is spent not only on integrating the system (T integrate ), 
but also on exchanging messages between processes (T MPI ). This time is ob- 
tained by adding the time spent on overcoming network latency (latency) and 
the time spent transferring particles (^bandwidth) 



n 



steps 



^~MPI — (^latency "I" ^bandwidth) • 



i=l 



In our implementation ti ate ncy is given by the sum of the latencies of each MPI 
call in the code. The copy algorithm uses 1 MPI_Allgather a nd 1 MPI_Allgath erv 



comm and (of which the latencies both scale with log 2 p ( IGualandris et al. 



20071 )) per block time-step, resulting in a total time spent on latency of 



^latency — (^MPI_Allgather + ^MPI_Allgatherv) l°g2 V- (9) 

The time used for transferring particle data is given by tbandwidth, which is 
obtained by taking the total size of the data that has to be communicated 
(which is assumed to scale with 2(p — 1) for all-to-all communications), and 
dividing it by the network bandwidth (7w)- Particles are stored in 58 byte 
data structures, resulting in 



nll6(p — 1) 

^bandwidth 

PTbw 

For a wide area computer tiatency and ^bandwidth may be quite substantial, but 
the separation in parts, as given here, enables us to optimize our network 
computer with respect to the communication characteristics. 



(10) 



4-3 Grid of PCs with ring algorithm 



Unlike the copy algor i thm, the ring algorithm (discussed in detail in lFox et al. 



(119881 ); lAngus et al.l (119901 )) does not use a single all-to-all communication 
operation, and only requires the processes to have a partial copy of size N/p 
of the system. Communication occurs in a a total of p steps (or shifts). During 
every shift, each process performs a partial force integration by calculating the 
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forces between their local subset of n/p active particles and the N/p particles 
stored in local memory. Then, each process sends its updated particles to their 
neighbor. 

To model the performance for the ring algorithm, we use the model for the 
copy algorithm and redefine the time spent on force calculations (Tf orce ) and 
the time spent to communicate between processes (^latency and ^bandwidth); 
the force calculation and the MPI communications occur in multiple shifts. 
The time spent on a partial force calculation is given by 



nN 

^force.lshift Tforce n~ ■ (H) 
P A 

The total time spent on the force calculation is given by the time for a partial 
force calculation (iforce,ishift) multiplied by the number of shifts (p), 



N 

^force 7~force?^ • (12) 
p 

For every time step, our implementation of the ring algorithm uses 2 MPI_Allreduce 
communication commands for initialization, and 1 MPI_Sendrecv operation for 
each shift. The time spent overcoming network latency is then 



" latency 



2 log 2 £^MPI_Allreduce + P^MPI_SendRecv (13) 



The ring algorithm is more bandwidth intensive than the copy algorithm, as 
all block subsets are sent and received during a ring shift, multiplying the 
time spent on transferring particles by 2p. Per particle, 58 bytes have to be 
transferred, therefore the time spent on transferring particles is given by 

bandwidth = H6n/ r bw . (14) 
44 Single PC with GRAPE 

The GRAPE-6Af is a dedicated hardware component develope d by a group of 



resea rchers led by Junichiro Makino at the University of Tokyo (IFukushige et al 



20051 ). The GRAPE-6Af is the smallest commercially available GRAPE con- 
figuration, consisting of a single GRAPE module with a peak speed of about 
123 Gflops. It calculates the forces between particles, which is the bottleneck 
in the calculation, whereas the particle predictions and corrections are still 
mostly done on the host PC. 
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When a GRAPE is used, an iV-body simulation over one iV-body time unit 
consists of the following steps: 

(1) Read the input snapshot and initialize the iV-body system. 

(2) Compute the next system time t and select the block of active particles 
n in the system. 

(3) Predict the positions and velocities of the n active particles on the pc, 
and send the predicted values and the next system time to the GRAPE. 

(4) Predict the other particles in the system on the GRAPE. 

(5) Calculate the forces and their time derivatives, using the GRAPE, be- 
tween the n active particles and all N particles in the system. 

(6) Retrieve the forces and their time derivatives from the GRAPE. 

(7) Correct the positions, velocities and velocity derivatives of the n active 
particles, and update their time steps. 

(8) Repeat from step 2 until t has exceeded one iV-body time unit. 

(9) Write the output of the simulation and terminate it. 

When using GRAPE 

^pc ^prcd "I - ^corr; (l*-*) 



where t pre d = nT prcd . The time spent to integrate particles for one iV-body 
time unit is given by 



^-steps 

-^Integrate ^ t (tpc ^grape ^comm) • (1^) 
(=1 



Here 



^grape Tpip e TlN) (1^) 

is the time for calculating the forces on the GRAPE. The time needed by 
the GRAPE to calculate the force between two particl es is given by r p \p P . The 



communication between host and GRAPE is given by (jFukushige et all 120051 ) 



t comra = 60tin + 56tfn + 72tjn. (18) 

Here the time to respectively send or receive 1 byte of data to the GRAPE 
during different steps is given by t iy t n and tj, respectively. During these steps 
respectively 60, 56 and 72 bytes per particle in the block are transferred. We 
assume that ti = tf = tj. We derive tj by measuring rcsend, which is the time 
to send one 72-byte particle to the GRAPE. Therefore, tj = TQscnd/72. By 
rewriting U, t n and tj as factors of rcsend, we can simplify the equation for 

tomm to 
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tcomm = (60 + 56 + 72) (r Gscnd /72) n. (19) 

Time spent on calculating forces on the GRAPE (t gra pc) cannot be directly 
measured by timing parts of the code, because the GRAPE force calculation 
includes some communication between host and GRAPE as well. However, we 



can derive r n 



from the total time of the force calculation as was done in 



Harfst et al.l (120071 ). Therefore, we can rewrite r pipe as 



Tpipe = -^[tforce - (116r Gs cnd/72)]. (20) 

The time spent on performing N force calculations tf orC e is given by, 

£force = TGforceA^, (21) 

where rcforce is the time spent to calculate forces between two particles. We 
then introduce the time constant (referee) in the function for T P i pe , 



Tpipc = TGforco ~ (J116r G scnd/72) —j . (22) 

Using our derived functions for tg rape and t comm , we are now able to mode l 



the performance of the GRAPE. As mentioned in (IFukushige et al.l . 120051 ). 

TGforce « 4.3 ■ lO" 10 S. 



4-5 Grid of PCs with GRAPE and copy algorithm 



When using GRAPE and a parallel algorithm, the time spent to integrate 
particles for one iV-body time unit is given by 



^-steps 

-^Integrate ^ t (^pc ^grape ^comm) -?MPI- (^^) 



We determine time spent communicating between hosts (T MPI ) using Eq.[HJ To 
determine the time spent on the host (t pc ) we use the equation for the single 
process with GRAPE (see Eq.[T5]). However, as we correct only n/p particles in 
parallel algorithms, we apply Eq.[7]to determine the time spent by the process 
on correcting particles. 

In a parallel setup of GRAPEs, each process needs to communicate and cal- 
culate forces for a subset of n/p particle in every block. We replace n by 
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n/p in our equations for t grap e as well as t comm . Therefore, the time spent on 
calculating forces is given by 



TL 

tgrape = [Nr G{orcc - (116r Gscnd /72)] -, (24) 
and communication between between the hosts and the GRAPEs becomes, 



TL 

tcomm = (60 + 56 + 72) (r Gscnd /72) -. (25) 



4-6 Grid of PCs with GRAPE and ring algorithm 



In a ring algorithm, each process computes the forces between a local set of 
n/p active particles and the local system of N/p particles during a shift. Then, 
it sends the results to its next neighbor and receives another n/p particles from 
its other neighbor. Each of the nodes spends tg ra pe,ishift calculating the forces 
for n/p particles during one shift. Before the node has integrated the force on 
all n particles, a total of p shifts have passed, resulting in a total compute 
time for this node of 

t grape PtgTSupe,lshifti 

(26) 



where the time to calculate forces for a single shift (£ gra pe,ishift) is 

tgrape.lshift = [{N T G{orcc / p) ~ ( 1 16r Gscnd /72)] - . (27) 

When GRAPE is used, particles are 74 bytes each because they contain two 
additional arrays for storing the old acceleration and old jerk. Due to this 
increased particle size, 



^bandwidth — 148n/ r bw , (28) 

whereas ti a tenc y remains unchanged. The time spent communicating between 
hosts (Thpj) is calculated as for the ring algorithm without GRAPE. 



5 Results of the performance model 



We have applied the performance model from the previous section to the re- 
sults presented in §[3j In Fig.CQwe compare the measured wall-clock time (T app ) 
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for the copy algorithm on the grid with the performance model, Fig. [2] shows 
a similar comparison for the ring algorithm. To guide the eye, the results for 
a single GRAPE are also presented in both figures. The performance model 
tracks the real measurements quite satisfactorily, giving a slightly lower com- 
putation time for a single GRAPE while giving a slightly higher computation 
time for a simulation across grid sites. 

The communication overhead of a distributed computer often renders high per- 
formance computing on a grid inefficient. However, in the N-body problem the 
compute time scales with N 2 whereas the communication scales linearly with 
N. For sufficiently large N, there will eventually be a point where relatively 
little time is lost communicating, and the compute resources are efficiently 
used. 

In figuresd] and [2] we can see that, for GRAPE-enabled simulations, break- 
even between calculation and communication is reached around N ~ 10 6 . For 
large N, a grid of two GRAPEs will outperform a single GRAPE. Our grid 
setup included three GRAPE-enabled sites. The location of these sites (Asia, 
Europe and America) were as widely distributed as physically possible. A more 
modest grid across a single continent, will perform considerably better than a 
global grid. With the performance model that we constructed in §Hl we can 
now study various grid topologies without the need to physically build the 
environment and create a virtual organisation. 



5.1 Future Prospects 

We applied the performance model to three hypothetical grids of GRAPE 
nodes. These three grids are: 1) a grid of all the available GRAPEs on the 
planet, 2) a grid of sites with more than 1 Tflops in GRAPE speed, and 3) a 
recently established Dutch grid (Dutch ASCII Computer, DAS-30) equipped 
with GRAPEs. 

Since the launch of GRAPE-6, a total of 1115 GRAPE-6 modules have been 
deployed worldwide. Japan leads the GRAPE-yard with more than 800 mod- 
ules, followed by the US (119) and Germany (62). At the moment there are 
876 GRAPE-6 modules in Asia, 132 in North America and 107 in Europe. In 
Table H]we list the sites with more than 1 Tflops peak-performance in GRAPE 
hardware and their network characteristics. The network roundtrip with the 
longest latency is a roundtrip between Japan and Ukraine, whereas the net- 
work link with the longest latency (not given in Table H|) is the transpacific 
line between Japan and the US. 



5 http://www.starplane.org/das3/ 
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Table 4 

Overview of the major GRAPE clusters (> lTflops) on planet Earth, including their 
relative network latency characteristics. The first column identifies the site, followed 
by the name of the institute, the country and the number of GRAPE modules (8 
modules provide ~ 1 Tflops peak performance.). The fifth column identifies the site 
with which the latency, given in column #6, is shortest. The one but last column 
(column #7) identifies the site with which the latency, given in column #8, is 
longest. The total number of GRAPE modules is 996. 



ID 


institution 


country 


# GRAPEs 


nearest site 
[ms] 


farthest site 
[ms] 


A 


Nat. Astronomical Obs. 


Japan 


576 


B 


2 


M 


330 


B 


Tsukuba University 


Japan 


240 


A 


2 


M 


330 


C 


AMNH 


US 


40 


G 


5 


B 


190 


D 


Astron. Rechen Inst. 


Germany 


30 


K 


2 


B 


280 


E 


Rochester Institute 


US 


26 


F 


5 


B 


170 


F 


McMaster University 


Canada 


13 


E 


5 


B 


170 


G 


Drexel University 


US 


12 


C 


5 


B 


190 


H 


Max Planck Institute 


Germany 


12 


D 


10 


B 


290 


I 


University of Amsterdam 


Netherlands 


10 


K 


7 


B 


266 


J 


Wien University 


Austria 


10 


H 


10 


B 


290 


K 


Bonn University 


Germany 


9 


D 


2 


B 


280 


L 


Cambridge University 


England 


9 


I 


10 


B 


260 


M 


Main Astronomical Obs. 


Ukraine 


9 


J 


30 


B 


330 



Organizing all the GRAPEs on the planet would be a challenging political 
problem. Organizing only the 13 largest sites would be somewhat easier, there- 
fore we included a performance prediction of such an infrastructure as well. 
Constructing a virtual organization within the 4 universities (University of 
Amsterdam, Free University of Amsterdam, Leiden University and Delft Tech- 
nical University) that participate in the DAS-3 project (and equipping the 270 
available DAS-3 nodes with specialized hardware) would be much easier than 
doing this across various countries. The Dutch DAS-3 Grid is equipped with a 
fast Myrinet-lOG internet and distributed across the Netherlands, connected 
by 10Gb light paths between clusters. The latency for the longest path is es- 
timated to be 3 ms, and we estimate the bandwidth of the connections to be 
0.5GB/s. 



In the sequential case we do not take memory limitations into account. This 
assumption is unrealistic for predicting the performance of a GRAPE, since 
N < 262144, but GPUs have a si milar performance to GRAPE , and are able 
to store up to 13 million particles (jPortegies Zwart et all 120071 ). In late 2007, 
the launch of a double-precision GPU is expected, making GPUs usable for 
production-type direct-method N-body simulations. Additionally, in recent 
years the amount of memory on GPUs has been steadily increasing, and we 
expect this trend to persist in the near future. 



We model the ring algorithm on both the G3 and on the 13 largest sites 
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(see Table @J. We adopted palpatine (see Table [T]) as the workhorse host for 
the GRAPEs and adopted the network characteristics as listed in Table HI 
The ring algorithm for our hypothetical grid experiment was assumed to be 
optimized for the use of distributed clusters of GRAPEs. The algorithm avoids 
latency intensive network links by combining communication of local clusters 
from multiple shifts. Thus, communication for the local (across node) network 
is separated from the global (across sites) communication. Finally, we assume 
that all networks between the sites have reasonable support for MPI multicast 
and gather operations, and that the latency of these operations scales with 
log 2 p 

The results of the hypothetical global GRAPE grid are presented in Fig.Hl 
Here we see that a global grid in which all GRAPEs participate outperforms a 
single GRAPE by about two orders of magnitude for N £ 10 s particles. For a 
sufficiently large number of particles (N £ 10 9 ) the total peak performance of 
the global GRAPE grid approaches about 75 Tflops. Eventually, the grid with 
all the GRAPES would outperform the grid with only the largest machines by 
about 25%, proportional to the number of GRAPEs in the two setups. When 
running simulations of N ~ 10 6 it is faster to run on three large GRAPE sites, 
than to use all the GRAPEs on the planet in parallel. 

The dashed curve in Fig. H] shows the performance of the model assuming that 
all the 270 nodes of the DAS-3 were equipped with GRAPE-6Af hardware. 
With such a setup, the maximum performance of about 35 Tflops is achieved 
for N ~ 10 7 particles. This is an interesting number for production simulations 
for astronomical research. 

In Fig. [5] we present the wall-clock time for each of the different ingredients 
of a grid calculation with GRAPEs on the DAS-3, using the performance 
model. Break-even between calculation (straight solid curve) and communica- 
tion (thick dotted curve) is achieved around N ~ 3 • 10 6 . For this large number 
of particles the communication between GRAPE and host, the predictor and 
the corrector steps require little CPU time compared to the force calculation 
on the GRAPE. For N £ 6 • 10 6 this setup would give an efficient use of the 
special processors, and high performance calculations on the grid would then 
be quite efficient. 



6 Discussion and Conclusions 

We studied the potential use of a virtual organization in which GRAPEs are 
used in a wide area grid. For this purpose, we developed a performance model 
to simulate the behavior of a grid in which each of the nodes is equipped 
with special purpose GRAPE hardware. We tested the performance model 
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,8 



N 

Fig. 4. Speedup prediction of possible GRAPE grid setups compared to a single 
CPU. The solid lines indicates execution time using 1 CPU (horizontal reference 
line) or 1 GRAPE with infinite memory (curved line). The double-dotted line in- 
dicates the predicted speedup if all 1115 GRAPEs are linked together to perform 
one simulation using an optimized ring algorithm. The dashed line indicates the 
predicted speedup if all GRAPE sites with more than 1 Tflops are linked together 
to perform one simulation using an optimized ring algorithm. The dotted line indi- 
cates the speedup if all 270 nodes in the Dutch DAS-3 grid would be equipped with 
GRAPEs. 




10° ' — — ■-' — ■ •■• '■' — ^ — ' 

10 3 10 4 10 5 10 6 10 7 10 a 

N 

Fig. 5. Predicted decomposition of performance of a DAS-3 GRAPE grid. The thick 
solid line indicates total execution time and the flat thick dashed line indicates time 
spent due to network latency. The thin dashed line indicates time spent on using 
the network bandwidth and the steep thick dotted line indicates time spent on 
calculating forces. The three bottom lines indicate time spent on communication 
between hosts and GRAPEs (upper thick dash-dotted line), correcting particles 
(middle dash-dotted line), and predicting particles (lower dotted line) 
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with an actual grid across three sites, each of which is located on a different 
continent. We used GRAPE hardware in Japan, the Netherlands and the USA 
simultaneously for calculations of 1024 up to 196608 particles. 

With these particle numbers we were able to have a better performance than 
a single computer without GRAPE. We measured a grid speedup of T ~ 1.37 
for a grid of PCs, and a grid of GRAPEs performs another ~ 4 times faster. 
On the entire range of iV we were unable to reach superior speed compared 
to a single GRAPE. However, we estimate that a small intercontinental grid 
of GRAPEs will reach superior performance for N ^ 3 ■ 10 6 particles. 

We used our grid calculations with GRAPE to construct and calibrate a per- 
formance model, with which we studied the performance of a world-wide grid 
of GRAPEs. When all the GRAPEs on the planet would participate in a vir- 
tual organization it is possible to utilize the total machine's performance, but 
only for really large systems of N ^ 10 9 . Though the total performance for 
such a setup would be about 75 Tflops, such large N would still be impractical 
to run for production astronomical simulations. 



We conclude that organizing all the major GRAPEs on the planet in a virtual 
organization is probably not worth the effort. Organizing a few of the largest 
sites with GRAPEs within one continent, however, appears politically doable 
and computationally favorable. For the DAS-3, for example, the GRAPEs 
would be used at maximum performance for a feasible number of stars. Mod- 
ern simul ations of up to about a mil lion stars have been done before using 



GRAPE f lPortegies Zwart et all 120041 ). but these calculations were performed 



on a single cluster, rather than on a grid. A grid setup as proposed here would 
allow the simulation of a few million stars within a reasonable time span. 



If we were to equip the full DAS-3 wide area computer in the Netherlands 
with GRAPEs, maximum performance would already be achieved for N ~ 
6 • 10 6 particles. Though still large, such simulations would be very doable and 
have practical applications. We estima te that r u nning a system of N = 10 6 
stars with a Salpeter mass spectrum (jSalpeterl . Il955l ) over a wide range of 
stellar masses to the moment of core collapse would take about 4 months. 
The simulation would still be mostly dominated by network latency, but the 
high-throughput networking in the DAS-3 completely removes the bandwidth 
bottleneck. 



We have mainly discussed the use of GRAPEs in a virtual organization, but 
new developments in using graphical process i ng un i ts appear to achieve s imilar 



speeds as GRAPEs (IPortegies Zwart et all 120071 ; lHamada and Iitakal . 12007 



Belleman et all 120071 ). In addition, GPUs are equipped with a larger amount 



of memory, which allows us to exploit more memory-intensive, but also faster, 
parallel algorithms. Future grids are likely to be equipped with GPUs, as the 
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GPU will become part of the standard equipment for every PC. 

Although our proof-of-concept infrastructure was of limited size, we have 
shown that it is possible to use dedicated hardware components located across 
clusters for high-performance computing. Though the current performance 
over globally connected grids leaves a lot to be desired and much optimiza- 
tion remains to be done, the concept of using dedicated hardware components 
worldwide in parallel has been shown to work. It can therefore be applied 
to solving individual tightly-coupled scientific problems or as ingredient of a 
complex multi-physics simulation, such as simulating a full galaxy, given that 
the problem size is sufficiently large to overcome the networking limitations. 
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